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Abstract 

Background: The identification of robust lists of molecular biomarkers related to a disease is a fundamental step 
for early diagnosis and treatment. However, methodologies for biomarker discovery using microarray data often 
provide results with limited overlap. It has been suggested that one reason for these inconsistencies may be that 
in complex diseases, such as cancer, multiple genes belonging to one or more physiological pathways are 
associated with the outcomes. Thus, a possible approach to improve list stability is to integrate biological 
information from genomic databases in the learning process; however, a comprehensive assessment based on 
different types of biological information is still lacking in the literature. In this work we have compared the effect of 
using different biological information in the learning process like functional annotations, protein-protein 
interactions and expression correlation among genes. 

Results: Biological knowledge has been codified by means of gene similarity matrices and expression data linearly 
transformed in such a way that the more similar two features are, the more closely they are mapped. Two 
semantic similarity matrices, based on Biological Process and Molecular Function Gene Ontology annotation, and 
geodesic distance applied on protein-protein interaction networks, are the best performers in improving list 
stability maintaining almost equal prediction accuracy. 

Conclusions: The performed analysis supports the idea that when some features are strongly correlated to each 
other, for example because are close in the protein-protein interaction network, then they might have similar 
importance and are equally relevant for the task at hand. Obtained results can be a starting point for additional 
experiments on combining similarity matrices in order to obtain even more stable lists of biomarkers. The 
implementation of the classification algorithm is available at the link: http://www.math.unipd.it/dasan/biomarkers. 
html. 



Background support vector machines among others, have been used 

Analysis of gene expression from microarray experi- on gene expression data, especially in cancer studies 

ments has been widely used for the development of new [1,2]. 

physiological hypotheses useful for answering to both In these studies, the biological interest is mainly 

diagnostic and prognostic questions. In the last decade, focused on biomarker discovery, i.e. in finding those 

supervised classification analysis has experienced a large genes and proteins which can be used as diagnostic/ 

diffusion to address this task and several different meth- prognostic markers for the disease. Biomarkers provide 

ods like discriminant analysis, random forests and useful insight for a deeper and more detailed under- 
standing of the biological processes involved in the spe- 
cific pathology and might represent the targets for drug 

development [3]. Although high accuracy is often 
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patients have few genes in common [4,5], whereas bio- 
marker reproducibility is fundamental for clinical and 
pharmaceutical applications. Several works have recently 
pointed out that high reproducibility of biomarker lists 
is equally important as high classification accuracy [6,7]. 

In general, there are two stability issues arising in gene 
expression classification and analysis. Firstly, since train- 
ing data are often scarce, predictive models obtained 
from different datasets can be extremely different. Sec- 
ondly, since the number of features is generally very 
high, then features can be combined in many different 
ways to give solutions able to explain the data. As a 
consequence of this, many possible sets of features can 
be considered relevant to the task and equally good in 
terms of the accuracy. This characteristic makes the 
process of selecting the set of relevant features for a 
classification task a very hard problem. 

Bootstrap methods have been demonstrated helpful in 
addressing the first issue. In these approaches different 
classifiers are generated and different features (lists of 
biomarkers) are selected on different splits of data and 
the results are somehow averaged [8,9], thus preserving a 
high ranking only to those features that are consistently 
the most discriminating features over the splits [10,11]. 
However, this method does not solve the problem of the 
instability due to the high number of features. In fact, the 
crucial problem is that the classification task is under 
constrained. To address this issue, additional information 
available on the relationships between genes should be 
used to improve the stability with respect to the features 
of the classifiers. The basic idea of this strategy is to take 
into account the complex gene relationships, instead of 
considering genes as independent features. Several efforts 
in this direction have been recently presented in the lit- 
erature. In [12], pathway information has been incorpo- 
rated into the biomarker discovery process using 
available protein-protein interaction networks and con- 
sidering subnetworks as features. Logistic regression 
models have been applied on expression profiles of two 
cohorts of breast cancer patients and results have been 
assessed in terms of both agreement between subnet- 
works identified in the two datasets and classification 
accuracy. In [13-17] topological properties of Kyoto 
Encyclopedia of Genes and Genomes (KEGG) pathways 
[18] or networks reconstructed from gene expression 
data have been used to constrain the learning process. In 
particular, [16,17] use regularization and integrate prior 
knowledge defining KEGG pathway based penalty terms. 
The use of Gene Ontology (GO) [19] as prior informa- 
tion has been explored in [20], where the authors pro- 
pose a classification model based on functional groups of 
genes. All the above methods have focused on prediction 
performance, without considering in a systematic way the 
stability issue. Recent works have started considering the 



problem of biomarker list stability [21], but an overview 
of the ability of different sources of biological knowledge 
to improve the reproducibility of the results is not 
already available in the literature. 

Our work addresses the integration of prior knowledge 
in the learning process and, differently from previous 
works, compares the performance of different sources of 
prior knowledge. In particular, we propose a standardized 
way to incorporate in the kernel different types of biologi- 
cal knowledge like functional annotations, protein-protein 
interactions, and expression correlation among genes, with 
the only constraint that the information is codified by a 
similarity matrix. The feature space is then transformed 
such that the more similar two features are, the more clo- 
sely they are mapped. Similarity matrices are defined using 
metrics which are specific for each type of biological infor- 
mation used: semantic similarities [22] for the annotations 
on GO; topology-based similarity measures [23,24] for 
protein-protein interactions (PPI) extracted from Human 
Protein Reference Database (HPRD) [25]; pair- wise corre- 
lation and mutual information for gene expression data. A 
linear classifier resembling the Bayes Point Machine [26] 
is used as classification tool. The vector of weights pro- 
duced by this algorithm is used to rank the features and 
obtain a list of biomarkers. 

Differently from approaches that integrate different 
datasets [27] by combining kernels [28] to improve classi- 
fication performance and robustness of the results thus 
considering a different and maybe complementary aspect 
of the problem, our approach addresses the integration 
of prior knowledge in the learning process. It provides a 
standardized way to incorporate different types of biolo- 
gical knowledge in the kernel, with the only constraint 
that the information is codified by a similarity matrix, 
thus it can be used with any kernel method. 

As above mentioned, in this work we also compare 
the performance of different sources of prior knowledge 
and evaluate the performance using three real datasets 
from different studies exploring the same clinical classi- 
fication task. The assessment of the results obtained for 
different similarity matrices is based on the trade-off 
between predictive accuracy and feature ranking stabi- 
lity, measured using the Canberra distance [7]. In fact, 
the introduction of constraints in the feature space 
might lead to robust biomarker lists but poor discrimi- 
nation between the classes. Finally, we have evaluated 
the ability of different biological information to map the 
features on new feature spaces where the classes are 
more naturally separable. 

Methods 

Gene expression data 

Publicly available data from three breast cancer microar- 
ray studies were collected from Gene Expression 
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Omnibus (GEO) database [29] with accession numbers: 
GSE2990 [30], GSE3494 [31] and GSE7390 [32]. Data- 
sets were all hybridized using Affymetrix U133 Gene- 
chips™ (HG-U133A). Breast cancer has been 
extensively studied in the literature and the Estrogen 
receptor (ER) status is the most important prognostic 
factor as indicator of response to endocrine therapy 
[33]. Estrogen receptor 1 (ESR1) is the gene more 
directly associated with ER status and can mask other 
potential descriptors of the underlying pathophysiology 
[34]; therefore, probesets related to ESR1 were removed 
from all datasets. Only tamoxifen-untreated subjects 
were selected. Since there are subgroups of samples 
belonging to multiple datasets, redundant subjects were 
removed. Quality assessment of the raw data from each 
dataset was performed using the arrayQualityMetrics 
package in Bioconductor [35]. Any array failing quality 
controls on MA plots, box-plots and between- array dis- 
tances was not considered. Affymetrix chip definition 
files were used to annotate the arrays, resulting in 22207 
features. Gene expression intensity signal was derived 
and normalized independently for each dataset using 
robust multiarray average (RMA) algorithm [36]. The 
resulting datasets are described in Table 1. 

Integration of prior knowledge in the learning process 

Expression data are given as very high dimensional vec- 
tors of measurements. The high dimensionality makes 
the task of biomarkers discovery very hard. This is espe- 
cially due to the fact that the task is under constrained. 
In this paper, we propose to perform a linear transfor- 
mation of the examples (i.e. the biological samples) in a 
way that classifiers computed on transformed examples 
have a higher stability, hopefully preserving the accu- 
racy. This transformation is made by using prior biologi- 
cal information about genes in a way to maintain the 
structure of the problem. 

In this section we first introduce linear classifiers and 
some relevant facts about the embedding of data into 
feature spaces. Then, we describe our intuition and 
describe an algorithm that implements it. 

In the following we denote by { \ the exam- 

ples, i.e. the N dimensional vectors of expression data 
obtained for M subjects, where N is the number of genes. 
Each example has associated a binary label y m (m = 1,..., 
M) having values in 



Table 1 Breast cancer datasets used for the classification 



Datasets 


Samples 


ER+ samples 


ER- samples 


GSE2990 


116 


83 


33 


GSE3494 


155 


131 


24 


GSE7390 


152 


103 


49 



Linear classifiers 

We focus on linear classifiers which are simple and gen- 
erally perform well on gene expression analysis. Given a 
linearly separable classification task, there are in general 
infinitely many linear classifiers (hyper-planes) that can 
correctly classify the examples. This set is commonly 
called the version space. When the number of features is 
very high, the version space tends to have a large 
volume. Formally, the version space for linear classifiers 
can be defined as: 

V = {w\y m (w • x m ) > 0, for each m = 1, ...M, \\w\\ = 1} (1) 

Without any loss in generality, we consider weights of 
unitary norm. A very popular algorithm to find a linear 
classifier which correctly separates the training examples 
(i.e. an element of the version space) is the Perceptron 
algorithm [37] which can be briefly described as in the 
following. We assume the training vectors x and w are 
of size n, and w is initially set to the zero vector. The 
algorithm runs in epochs. On each epoch all the training 
examples x b for i = 1,.., M, are presented to the algo- 
rithm and the vector w is updated whenever the asso- 
ciated classifier makes a mistake on x- v i.e. if (y t sign(w • 
Xj) < 0) then w = w+jiXi. When the training set is line- 
arly separable, the perceptron is guaranteed to even- 
tually converge to a vector (hyper-plane) which correctly 
separates the training data, i.e. the solution is an ele- 
ment of the version space. 

It can be shown that other kernel based algorithms, 
like for example the hard version of Support Vector 
Machines (SVM), whose description is beyond the scope 
of the paper, also have solutions in the version space. In 
the particular case of SVM this solution is in fact unique 
and is the one which maximizes the margin on the 
training set [38]. As shown in [26] the center of mass of 
the version space, the so called Bayes point (Bp), would 
be the optimal choice, even better than SVM (which by 
the way can be considered an approximation of the Bp), 
with nice theoretical properties in terms of its generali- 
zation ability. An algorithm that approximates this opti- 
mal Bp solution, the so called Bayes point machine, has 
been proposed, which considers the average of the solu- 
tions of several runs of the perceptron. Our algorithm, 
which is presented in the following, is based on a var- 
iant of the Bayes point machine. 

Note that when a feature space is characterized by 
high dimensionality and the features are considered 
independent, i.e. there are many more variables (fea- 
tures) than constraints (examples), the task is under 
constrained. This often implies that the version space 
volume is large and can change extremely both in form 
and size depending on which examples are used for 
training. It is clear that this produces instability. We will 
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see in the following how to add available domain knowl- 
edge to introduce structural constraints in the problem 
in order to improve robustness of a linear classifier. 
Feature ranking 

The values w t of a linear classifier represent the degree 
of importance and the bias that a given feature i pro- 
vides to the decision. High positive (negative) values tell 
us that such feature is important to classify an instance 
as positive (negative). For this reason, the absolute value 
of the weights can also be used as a criterion for feature 
ranking. 

Similarity matrix integration 

When prior knowledge is available providing informa- 
tion about gene-gene similarity, this knowledge can be 
effectively used by mapping examples into a feature 
space where linear solutions preserve these similarities. 

Consider a linear transformation of the data via a 
matrix P, i.e. (p(x) - Px. Now, can we say something 
about the desirable properties of q> which make the task 
of discriminating positive versus negative examples sim- 
ple enough in the target space? It is well known that a 
measure of the goodness of an embedding is the ratio 
between the maximal norm R> the highest norm (or 
length) of any example x m) and the margin y of the 
examples, namely G = (R/y) 2 . For separable data, the 
margin is defined as the distance between the optimal 
separating hyper-plane (SVM) and the examples. In the 
case of perceptron classifiers, the value G is also related 
to the number of mistakes the perceptron algorithm 
makes to converge [37]. These considerations seem to 
indicate that the margin of transformed examples should 
be large in order to get high performance. However, 
when the expected margin (or equivalently, the expected 
volume of the version space) is too large, it generally 
leads to unstable solutions for small datasets. A solution, 
which represents a trade-off between these two (appar- 
ently) opposite goals, is to choose an embedding of data 
where norm of vectors are as small as possible but data 
remain linearly separable. Specifically, we propose to 
make a linear embedding of data via a bi-stochastic 
matrix. We focus on stochastic matrices because they 
have the property to map vectors x into shorter ones 
(compression) and thus to make the maximal norm R of 
target examples smaller (this is due to the fact that the 
eigenvalues of a stochastic matrix are all in [0, 1]). As 
we have previously seen, this together with large margin 
solutions guarantees a good performance of the 
embedding. 

Let S be a symmetric similarity matrix with elements 
in [0, 1] with l's in the diagonal, the associated stochas- 
tic matrix P is obtained as in the following: 

P = D-\l + a{S-I)) (2) 



where / is the identity matrix, D is a diagonal matrix 
with elements corresponding to sums of elements in the 
rows/columns of (I+a(S-I))> and a > 0 is a tuning para- 
meter. Note that when a = 0, we have P = I and the fea- 
ture space coincides with the original space. The 
parameter a is fixed according to the best stability per- 
formance, measured by the Canberra distance (Equation 
15). 

Now, let be given a perceptron-like solution in the tar- 
get space, then the weight vector can be expressed as a 
weighted sum of the examples in feature space, namely 
w = ^ An(p(Jm)> and the following holds: 

I w i - w i I = | J2 m J2 k p ^ x mk - J2 m a* H k | = 

\^m (3) 

= |(P<-Pj)-fc| <c|(Pi-Pi)| 

where c > 0 is a constant which does not depend on 
indices i and Thus, we can see the matrix Pas a cod- 
ing matrix for genes. Specifically, the i-th gene is codi- 
fied by P t . This result shows that when two genes have 
similar codes, the difference in the weight vector cannot 
be too large. 

It is important to note that this result does not imply 
that the same gene will have the same position in the 
ranking generated by two independent experiments, i.e. 
that the same biomarkers will be selected. The result 
above simply means that the relative position of two 
similar genes will be similar in the two experiments. 
However, if the matrix P contains reliable information, 
this should hopefully produce similar lists of biomarkers. 
Classification algorithm and biomarker list generation 
The proposed algorithm is based on the perceptron 
algorithm and resembles the Bayes point machine. The 
algorithm starts by mapping data using the matrix P. 
The transformed data are standardized by subtracting 
from each gene expression value its mean across the 
samples and dividing by its standard deviation. Then, 
data are randomly split (70% training, 30% test) for a 
number T = 1000 of times. For each one of these splits 
a run of the perceptron algorithm is performed on its 
training data (to increase randomization data are also 
shuffled before each perceptron epoch). Thus, for each 
split t, a weight vector w t is obtained and normalized to 
unitary norm. For each split, the accuracy a t is also eval- 
uated with respect to the test partition. The final solu- 
tion is obtained as the average of weight vectors w t , i.e. 
W = AVE(w t ). 

Note that the expected accuracy of W on new unseen 
examples can also be estimated by using available data 
with the following method. Let Q be the design matrix 
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with entries Q tm = 1 if the example x m is in the training 
partition of split t, and 0 otherwise. For each example 
x m a predictor W(m) = AVE(w t ) is built using just the 
weights w t such that Q tm - 0, i.e. we take the average of 
the weight vectors for the construction of which the 
example x m was not used. Finally, the classifier W(m) is 
tested against x m . It is easy to see that the accuracy we 
observe applying this method on all available data is an 
estimate of the expected accuracy of W. The list of bio- 
markers returned by the algorithm is the list of genes 
ordered according to the absolute value of their corre- 
spondent value in W. 

The method described above can also be seen as a 
leave-one-out estimate of the accuracy. However, the 
same method can be easily adapted to a (/c-fold) cross- 
validation type of analysis. In this case, the overall pro- 
cedure would change as in the following: (i) Split data 
in k sets X lf .., X k ; (ii) Train models W 1} .., W m , where W t , 
t = 1, ..,/<, is learned on the set X\X t , with the method 
presented above, and get the accuracy ACC(X t ) on the 
set X t ; (iii) Evaluate the overall accuracy as the average 
of these partial accuracy estimates. 

The advantage of using a /c-fold type of analysis 
instead of the leave-one-out type of analysis is its lower 
variance for small samples. The disadvantage is that the 
method is more computational demanding. We have 
done some experiments using both methods and we 
have not observed big differences in the obtained results 
with our data. 

Similarity matrices 

Three different kinds of data were considered as prior 
knowledge to be integrated in the feature ranking: 1) 
Gene Ontology functional annotations; 2) the network 
of protein-protein interactions; 3) gene expression pro- 
files from a collection of breast cancer studies. All these 
data were used to calculate different kinds of similarity 
measures sy between pairs of features i and j based on: 

♦ Semantic similarity of functional annotations; 

♦ Topological similarity in the network of protein-pro- 
tein interactions; 

♦ Correlation between gene expression profiles. 

The corresponding similarity matrix S for N variables 
is the symmetric N-N matrix whose element s t j refers to 
the similarity between the features i and 

In the following, the methods for codifying the three 
types of prior knowledge into the corresponding similar- 
ity matrices are described in details. Since in this work 
we are considering Affymetrix data, indexes i and / refer 
to probesets. What follows can be easily generalized to 
consider genes or proteins. Each subsection first 
describes the biological information and then illustrates 
the metrics used to generate the corresponding similar- 
ity matrix. 



Semantic similarity 

Gene Ontology (GO) is the most widely used annotation 
database that collects biological information on gene 
products. This controlled vocabulary consists of three 
independent categories: molecular function, biological 
process and cellular component [19]. GO terms are 
organized in a directed acyclic graph (DAG) in which 
each node corresponds to a GO term. Each node may 
have multiple parents: nodes farther from the root (high 
level nodes) correspond to more specialized terms, 
nodes closer to the root (low level nodes) to less specia- 
lized terms, thus implying that genes annotated with a 
specific node are also annotated with every ancestor of 
that node (true path rule). 

In this work molecular function and biological process 
GO annotations related to the probesets were downloaded 
from NetAffx database [39], while the DAG structure was 
extracted from the Bioconductor package GO.db. 

Semantic similarity was used to assess the degree of 
relatedness between two features by assigning a metric 
based on the likeness of the semantic content of their 
GO annotation. An information-theoretic method, based 
on the concept of Information Content {IC), was 
adopted [40]. 

The IC for the GO term t is defined as: 



IC(t) = - log 



freqjt) 
frecf(rooi) 



(4) 



i.e. the negative logarithm of the ratio between the fre- 
quency of the term t¥ in a corpus of annotations (i.e. 
the number of times the term t and each of its descen- 
dants occur in GO annotation) and the frequency of the 
root term (corresponding to the sum of the frequencies 
of all GO terms). The IC decreases monotonically when 
moving from the leaves toward the root node (IC - 0). 
The intuition behind the use of the IC is that the more 
probable a concept is, the less information it conveys. 

The Best-Match Average (BMA) approach [41] was 
used to calculate the semantic similarity scores s t j 
between two features i and ;: 



\GOi\ 



max ueGOj Sim Lin {t,u) + 



\GOj\ 



E max £eGOl Sim L/n (w / t) 



(5) 



where GO; and GO ; are the groups of GO terms t and 
u associated to the features i and respectively and 
Sim Lin (t, u) is the Lins similarity measure [42], which 
exploits the IC of the two GO terms t and u to generate 
normalized similarity measures in the range [0, 1], 
according to the following equation: 



Sim Lin (t,u) = 



2IC(MICA(t,u)) 
IC{t) + lC{u) 



(6) 
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where MICA indicates the most informative common 
ancestor. The BMA approach (Equation 5) is able to 
robustly assess the global similarity between two fea- 
tures also when they are annotated to a different num- 
ber of GO terms, since it considers both the GO terms 
they share and the GO terms in which the features dif- 
fer, but only the most similar ones are matched [22]. 
Topological similarity 

Topological information on PPI was extracted from 
HPRD [25]. This repository contains manually curated 
scientific information pertaining to the biology of most 
human proteins and is the database that includes most 
human protein-protein interactions, as shown in [43]. 
The 22207 features in our datasets were mapped into 
9521 proteins using RefSeq identifiers; this resulted in 
37080 interactions. Since different proteins can be asso- 
ciated to different probesets, the value of the similarity 
score Sij between probesets i or ; was obtained by aver- 
aging the similarity scores of the associated proteins: 



TFT £ max P^ I s (Pu Pi)} + j-jt 12 max Pi . ePi {s(p jf pi)} 

Mil Pi^Pi \Pj\ PjePj 



(7) 



where P t and Pj are the sets of proteins p t and pj 
annotated to the probesets i and y, respectively. 

Four topological measures were used to calculate the 
topological similarity scores s{p i} pj) between pairs of 
proteins p t and pf. 1) normalized geodesic distance; 2) 
Jaccard coefficient; 3) functional similarity; 4) a probabil- 
istic common neighborhood similarity. 

In order to describe the four topological similarity 
measures considered in this study, we first introduce 
some terms and notations. The network of the interac- 
tions is defined as graph G = (V, E) consisting of a set 
of nodes V and a set of edges E between them. p t and pj 
refer to proteins which are the nodes of the network; N 
(pi) and N(pj) are the neighbors of p t and pj respectively, 
and N{p b pj) = N(pt)n N(pj). 

Normalized geodesic distance The normalized geode- 
sic distance (NG) between two proteins p t and pj is 
defined as the normalized length of the shortest path, / 
{path{pt, pj)), from p t to pj , obtained by dividing {{path 
(fi> Pj)) by the maximum of the shortest paths between 
all pairs of proteins. The similarity s{p h pj) between 
two proteins is derived as 1 minus the normalized 
shortest path: 



s[Pi,Pj) 



I (path (pi, pj)) 



* ™^{ l ( path ( pk,pr W 

p k ,p r eV(G) 



(8) 



Jaccard coefficient The similarity measure s(p i> pj) 
based on the Jaccard coefficient (JA) [44] is defined as 
the ratio between the number of neighbors which two 



proteins share (common neighbors) and the total num- 
ber of proteins they are connected to: 



s(pupj) = 



Hpupj)\ 

\N(Pi)UN{Pj)\ 



(9) 



Functional similarity The functional similarity (FS) 
proposed in [23] measures the common neighborhood 
similarity of two proteins p t and pj in the network G, as: 



s(.Pi,Pj) - 



where 



2\N(Pi,Pj)\ 



2\N(p ilPj )\ 



HPi) ~ N(ft-) | + 2 \N( Pi , Pj ) | + Ay \N(Pj) - N( Pi ) | + 2 |N(fc, Pj ) \ + k fi 



(10) 



ky = max(0,n^ - (|N(p f ) - N(ft)| + 2 \N{p if pj)\)) (11) 

and n avg is the average number of neighbors of pro- 
teins in the network. The term A# penalizes the score 
between protein pairs where at least one of the proteins 
has too few neighbors. 

Probabilistic common neighborhood similarity A 
probabilistic measure for the statistical significance (SC) 
of the common neighborhood configuration of two pro- 
teins pi and pj has been recently proposed [24]. The 
measure is defined as the negative logarithm of the 
probability of p t and Pj having a certain number of com- 
mon neighbors by random chance: 

s{p uPj ) = -log 10 (prob (N, \N{ Pi )\ , |N(ft-)| , H^Pj)\)) (12) 

N is the total number of proteins in the network, and 
prob(N, \N(pi)\, \N(pj)\, \N(p u pj)\) is computed on the 
basis of the Hypergeometric distribution: 



prob{N,\N{p i )\,\N{p j )\ l \N{p i ,Pj)\) - 



HMPi)\>MPi)\) 

E 

k=H P uPi)\ 



,. )|} ^|N(p,)|^IN| 



1 V IMP;) | 



f |N| I 

{HPi)\J 



(13) 



Thus, the higher the probability (13), the higher the 
value of s(p b pj) is. 

Equations (8), (9), (10) and (12) are finally used to 
derive s t j using Equation (7). 
Correlation based similarity 

Publicly available data from ten breast cancer microar- 
ray studies were extracted from GEO, selecting those 
with a medium to large sample size (Table 2). Redun- 
dant subjects were removed. All datasets were hybri- 
dized using Affymetrix U133 Genechips™ (HG-U133A 
and HGU133plus2) and were analyzed using A-MAD- 
MAN, an open source web application, which allows the 
retrieval, annotation, organization and meta-analysis of 
gene expression [45]. In particular, the software enables 
the integrative analysis of data obtained from different 
Affymetrix platforms through meta- normalization. Affy- 
metrix chip definition files were used to annotate the 
arrays and gene expression intensity signal was 
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Table 2 Breast cancer datasets used for the correlation 



based similarity 



Datasets 


Platform 


Samples 


GSE2034 [52] 


HGU133A 


286 


GSE6532 [53] 


HGU133A / HGU133plus2 


225 


GSE11121 [54] 


HGU133A 


200 


GSE2990 [30] 


HGU133A 


189 


GSE1456 [55] 


HGU133A 


159 


GSE7390 [32] 


HGU133A 


155 


GSE5460 [56] 


HGU133plus2 


127 


GSE3494 [31] 


HGU133A 


110 


GSE5847 [57] 


HGU133A 


95 


GSE4922 [58] 


HGU133A 


40 



normalized using RMA algorithm [36]. The resulting 
gene expression matrix collects the expression levels of 
21921 probesets over 1586 biological samples. 

Gene expression profiles over the ten datasets were 
compared using similarity measures based on Pearson 
correlation coefficient (PE), Spearman rank correlation 
coefficient (SP) and Mutual Information (MI), which 
provide a general measure to analyze dependencies in 
gene expression data [46-48]. 

Using both the Pearson and the Spearman correlation, 
the similarity s t j between two probesets i and ; was 
defined as: 

Sij=\pij\ (14) 

To calculate Mutual Information we needed to quan- 
tize data on L intervals. There is no optimal solution to 
choose L, since it depends on data normalization and on 
the particular biological application [49]. As suggested in 
[47], heuristic lower/upper bounds on the number of 
intervals were considered [50,51]: MIi ow - [l+log 2 wj 
and MI up = Vra, where m is the number of expression 
values. In our case, L = 25. 

Evaluation of the biomarker lists 

Results were evaluated in terms of both stability of the 
biomarker lists obtained by the Canberra distance [7] 
and the accuracy performed by the perceptron classifier. 

Given two ordered lists Tl and T2 of p ranked fea- 
tures, the Canberra distance between them is defined as: 

Ca(Tl,T2) = f: ^ ( ; ) - T2 ^ (15) 

where Tj(i) and z 2 (i) indicate the rank, i.e. the position, 
of the feature i in the ordered lists Tl and T2, 
respectively. 

This measure is a weighted version of the Spearman's 
footrule which considers the variations in lower portions 



of the lists less relevant than those in the top [7]. A nor- 
malized version of this measure can be obtained by 
dividing the distance in (15) by its expected (average) 
value, approximated by (log(4) -l)p + log(4) - 2 for the 
complete lists. The normalized Canberra distance ranges 
between 0 (maximal stability) and 1.4 (maximal instabil- 
ity), with 1 in the case of randomly generated lists. 

The average number of iterations needed by the per- 
ceptron in the algorithm is also considered as a good 
indicator of the ratio between the maximal norm of 
transformed vectors and the margin one can obtain in 
feature space. We consider this value as a measure of 
how much difficult is the transformed task. 

Ranked feature lists obtained using different similarity 
matrices were evaluated both within datasets, i.e. com- 
paring the 1000 different lists obtained using Bootstrap, 
and between datasets, i.e. comparing the global lists 
obtained by analyzing datasets GSE2990, GSE3494 and 
GSE7390. For the within dataset comparison, the Can- 
berra distance was applied on the 1000 complete lists 
resulting from the Bootstrap resampling schema adopted 
by the classification algorithm. For the between datasets 
comparison, the Canberra distance was applied on the 
sublists of length /<, with k corresponding to the mini- 
mum Canberra distance within dataset (average of the 
three values obtained for the three datasets). Finally, for 
the best performing similarity matrices, the union of the 
sublists of length k obtained using the three datasets, 
where k ranges from 1 to the maximum number of fea- 
tures (22207), was considered in order to quantify the 
possible lack of consistency of the global lists. 

Results 

Within dataset assessment 

Table 3 reports the average normalized Canberra dis- 
tance and classification accuracy for all the three breast 
cancer datasets and for all similarity matrices. Results 
are reported for the cases where prior information is 
not used (a = 0) and using for each similarity matrix 
the value of a (Equation 2) which minimizes the Can- 
berra distance. For all the three datasets, all types of bio- 
logical information are able to decrease the average 
normalized Canberra distance over the biomarker lists 
with respect to the standard classification approach. In 
particular, three types of prior knowledge are best per- 
formers in this task: Gene Ontology Biological Process 
(GO BP), Gene Ontology Molecular Function (GO MF) 
and protein-protein interactions codified by the normal- 
ized geodesic distance (PPI NG). For these three types 
of biological knowledge, the improvement in list stabi- 
lity, which ranges between 26% and 37%, is achieved 
without a corresponding loss in accuracy since this lat- 
ter changes in a range between minus 2% to plus 3%. 



Sanavia et al. BMC Bioinformotics 2012, 13(Suppl 4):S22 
http://www.biomedcentral.com/l471-2105/13/S4/S22 



Page 8 of 1 1 



Table 3 Classification performance within breast cancer datasets 




GSE2990 


GSE3494 


GSE7390 


No prior 


0.89 (95%) 


0.93 (93%) 


0.90 (98%) 




7 


1 n 
I u 


O 


GO BP 


0.62 (93%) 


0.63 (95%) 


0.60 (96%) 




15 


21 


13 


GO MF 


0.63 (93%) 


0.68 (94%) 


0.60 (97%) 




1 7 
I / 


1A 


1 ^ 
i j 


PPI NG 


0.57 (94%) 


0.58 (96%) 


0.53 (97%) 




1 n 
I u 


1 A 
I '4- 


Q 

y 


PPI JA 


0.87 (95%) 


0.91 (93%) 


0.87 (97%) 




7 
/ 


1 1 
i i 


7 


PPI 


U.OO \yJ /0) 


U.yZ \yj /0) 


D 99 (Q70/M 

U.OO ^/ /OJ 




7 


11 


7 


PPI sc 


0.83 (95%) 
8 


0.86 (95%) 
13 


0.83 (96%) 
8 


PE 


0.78 (95%) 


0.89 (96%) 


0.79 (96%) 




49 


56 


37 


SP 


0.78 (95%) 


0.89 (95%) 


0.79 (95%) 




48 


56 


38 


Ml 


0.76 (91%) 


0.80 (94%) 


0.73 (94%) 




130 


207 


131 



Normalized Canberra distance between feature lists obtained for datasets GSE2990, GSE3494 and GSE7390, using the standard classification approach without 
prior knowledge integration and different prior knowledge based similarity matrices: Gene Ontology Biological Process (GO BP), Gene Ontology Molecular 
Function (GO MF), protein-protein interactions codified by the normalized geodesic distance (PPI NG), the Jaccard coefficient (PPI JA), the functional similarity (PPI 
FS), the probabilistic common neighborhood similarity (PPI SC), the Pearson correlation (PE), the Spearman rank correlation (SP) and the Mutual Information (Ml). 
Predictive accuracy is indicated in brackets, whereas the number of iterations obtained by the classifier is reported below the other scores. 



Table 3 also reports the number of iterations needed 
by the classification algorithm to reach convergence, 
averaged across the 1000 Bootstrap splits. Compared to 
other types of prior knowledge, the higher number of 
iterations are observed with the correlation (PE and SP) 
and Mutual Information (MI) based matrices, whereas 
PPI measures lead the classifier to reach convergence 
with a lower number of iterations, i.e. they improve 
class separability. However, except the normalized geo- 
desic distance, all the other protein-protein interaction 
measures show the lowest gain in reproducibility. 

Between datasets assessment 

Table 4 reports the average Canberra distance obtained 
by comparing datasets GSE2990 vs GSE3494, GSE2990 
vs GSE7390, GSE3494 vs GSE7390, and the resulting 
average Canberra distance together with the average clas- 
sification accuracy across the three datasets for k corre- 
sponding to the minimum Canberra distance within 
datasets (average of the three values obtained for the 
three datasets). GO BP, GO MF and PPI NG are con- 
firmed as the best performing kinds of prior knowledge. 
In addition, MI based similarity matrix shows perfor- 
mance comparable to the former similarity matrices. 

In order to better assess the improvement highlighted 
in these four similarity matrices, we have looked at the 
size of the union sets of the biomarker lists of length k 
over all the three datasets (Figure 1). The more two lists 
are similar, i.e. containing the same features, the more 



the points of the curve are drawn near the diagonal. 
Compared with the standard approach, the union lists 
obtained from GO BP, GO MF and PPI NG are able to 
improve the feature ranking, but no meaningful 
improvements are evident for the similarity matrix 
obtained using MI similarity matrix. In particular, the 
two GO BP and GO MF based matrices provide the 



Table 4 Canberra distance and accuracy across breast 
cancer datasets 





k 


GSE2990 
vs 

GSE3494 


GSE3494 
vs 

GSE7390 


GSE2990 
vs 

GSE7390 


Mean 
Canberra 
Distance 


Mean 
Accuracy 


No 


4182 


0.95 


0.94 


0.94 


0.94 


95% 


prior 














GO BP 


4268 


0.63 


0.65 


0.65 


0.65 


95% 


GO MF 


3456 


0.62 


0.62 


0.63 


0.63 


94% 


PPI NG 


8684 


0.62 


0.61 


0.62 


0.62 


96% 


PPI JA 


22207 


0.96 


0.96 


0.97 


0.97 


95% 


PPI FS 


22207 


0.96 


0.97 


0.97 


0.97 


96% 


PPI SC 


22207 


0.91 


0.92 


0.93 


0.93 


95% 


PE 


128 


0.70 


0.72 


0.74 


0.74 


96% 


SP 


163 


0.68 


0.71 


0.62 


0.62 


95% 


Ml 


310 


0.62 


0.65 


0.64 


0.64 


93% 



Pair-wise Canberra distance between the three breast cancer datasets at 
different number of features selected according to the minimum Canberra 
distance within datasets, using the standard classification approach without 
prior knowledge integration and different prior knowledge based similarity 
matrices. The corresponding mean value and the mean accuracy obtained 
across the three datasets are also reported. 
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k 

Figure 1 Feature list stability. Number of features in the union lists of length k, obtained by the standard classifier (No prior) and the 
integration of the best performing biological information: GO Biological Process (GO BP), GO Molecular Function (GO MF), protein-protein 
interactions codified by the normalized geodesic distance (PPI NG) and mutual information for gene expression data (Ml). 



most stable union lists for k around 5000 features, 
whereas PPI NG matrix achieves the best performance 
for k around 9000 features. 

Discussion 

The subject of the investigation in this paper is the effect 
of using information from the biological domain into a 
learning process with the aim of improving its general per- 
formance with respect to the stability of predicted biomar- 
kers. State-of-the-art machine learning methods give 
solutions with empirically good performance in terms of 
accuracy. However, the stability of the selected biomarkers 
is also a very important issue. If an accurate system tends 
to select the same biomarkers in different independent 
experiments, then it is more likely that the selected bio- 
markers are the right ones. 

In this work, we have integrated gene expression data 
and biological prior knowledge to enhance biomarker lists 
stability in a classification approach. In particular, we have 
compared the effect of incorporating different types of bio- 
logical prior knowledge, like functional annotations, pro- 
tein-protein interactions and expression correlation 
among genes in the learning process by evaluating biomar- 
ker list stability and classification accuracy. 

Integrating prior knowledge is not an easy task since 
different types of information are represented in various 
data formats and stored in heterogeneous data structures. 
To do that, we have codified biological information into 



specific pair-wise similarity measures, chosen accordingly 
to the type of biological information used: semantic simi- 
larities for the annotations on GO, topology-based simi- 
larity measures for PPI and correlation for gene 
expression data. Feature space has then been mapped 
into a new space in which the more similar two features 
are, the more closely they are mapped. Our intuition is 
that when some features are strongly correlated to each 
other, for example because they belong to the same bio- 
logical process, then they likely have similar importance 
and are equally relevant for the task at hand. In other 
words, the weight vector obtained for a classification task 
should have similar values on indices relative to similar 
genes. Following this intuition, we can bias the solutions 
to fulfill this property. Experimental results seem to sup- 
port this intuition: our approach improves list stability, 
preserving high classification accuracy. In particular, 
three similarity matrices, based on GO BP, GO MF anno- 
tations and PPI NG, are the best performers in improving 
list stability. The lowest gain in biomarker list reproduci- 
bility is observed with the other matrices based on pro- 
tein-protein interaction networks, whereas those based 
on correlation and mutual information achieve a better 
reproducibility but lead the classifier to reach conver- 
gence with a higher number of iterations. 

In particular, the MI based matrix shows performance 
comparable to GO BP, GO MF and PPI NG based 
matrices when list stability is assessed between datasets. 
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This work compares the use of different biological 
information from genomic databases in the learning 
process. The technique proposed in this paper builds a 
kernel matrix from a similarity matrix, thus it can be 
used together with any kernel method (see [38] and 
references therein for a survey). In particular, it provides 
a standardized way to incorporate different types of bio- 
logical knowledge in the kernel, with the only constraint 
that the information is codified by a similarity matrix. 

Obtained results provide a starting point for additional 
experiments. As future work, we think it would be inter- 
esting to combine similarity matrices in order to obtain 
even more stable biomarkers, using for example the 
approach proposed by Bie et al. [28] to combine kernels. 
We believe that the power and potential of the proposed 
strategy will increase as the coverage and quality of bio- 
logical databases improve. 

List of abbreviations used 

Bp: Bayes point; AVE: average operation; ACC: prediction accuracy; DAG: 
direct acyclic graph; IC: Information Content; BMA: Best-Match Average 
approach; MICA: most informative common ancestor; NG: normalized 
geodesic distance; JA: Jaccard coefficient; FS: functional similarity; SC: 
probabilistic common neighborhood similarity; PE: Pearson correlation 
coefficient; SP: Spearman rank correlation; Ml: Mutual Information. 
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